Scalable K-FAC Training for Deep Neural Networks With Distributed Preconditioning
نویسندگان
چکیده
The second-order optimization methods, notably the D-KFAC (Distributed Kronecker Factored Approximate Curvature) algorithms, have gained traction on accelerating deep neural network (DNN) training GPU clusters. However, existing algorithms require to compute and communicate a large volume of information, i.e., factors (KFs), before preconditioning gradients, resulting in computation communication overheads as well high memory footprint. In this paper, we propose DP-KFAC, novel distributed scheme that distributes KF constructing tasks at different DNN layers workers. DP-KFAC not only retains convergence property but also enables three benefits: reduced overhead KFs, no low Extensive experiments 64-GPU cluster show reduces by 1.55×-1.65×, cost 2.79×-3.15×, footprint 1.14×-1.47× each update compared state-of-the-art methods. Our codes are available https://github.com/lzhangbv/kfac\_pytorch .
منابع مشابه
Adaptive dropout for training deep neural networks
Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adapt...
متن کاملExploring Strategies for Training Deep Neural Networks
Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization often appears to get stuck in poor solutions. Hinton et al. recently proposed a greedy layer-wise u...
متن کاملDistributed Newton Methods for Deep Neural Networks
Deep learning involves a difficult non-convex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and the synchronization cost may become a bottleneck. In this...
متن کاملScalable Bayesian Optimization Using Deep Neural Networks
Bayesian optimization is an effective methodology for the global optimization of functions with expensive evaluations. It relies on querying a distribution over functions defined by a relatively cheap surrogate model. An accurate model for this distribution over functions is critical to the effectiveness of the approach, and is typically fit using Gaussian processes (GPs). However, since GPs sc...
متن کاملA Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets
Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NT...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Cloud Computing
سال: 2022
ISSN: ['2168-7161', '2372-0018']
DOI: https://doi.org/10.1109/tcc.2022.3205918